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ABSTRACT 

The primary objective of this study was to investigate how 
incorporating prior information improves estimation of item parameters in two 
small samples. The factors that were investigated were sample size and the 
type of prior information. To investigate the accuracy with which item 
parameters in the Law School Admission Test (LSAT) are estimated, the item 
parameter estimates were compared with known item parameter values. By 
randomly drawing small samples of varying sizes from the population of test 
takers, the relationship between sample size and the accuracy with which item 
parameters are estimated was studied. Data used were from the Reading 
Comprehension subtest of the LAST. Results indicate that the incorporation of 
ratings of item difficulty provided by subject matter specialists/test 
developers produced estimates of item difficulty statistics that were more 
accurate than that obtained without using such information. The improvement 
was observed for all item response models, including the model used in the 
LSAT. ( SLD) 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 




f ■ 

PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL HAS 
I BEEN GRANTED BY 

J. VASELECK 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 




■ Small Sample Estimation in Dichotomous 
Item Response Models: Effect of Priors Based on 
Judgmental Information on the Accuracy of 
Item Parameter Estimates 



U S DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement 
EDUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

ry This document has been reproduced as 
V received from the person or organization 
originating it. 

□ Minor changes nave been made to 
improve reproduction quality. 



v. 



Points of view or opinions stated in this 
document do not necessarily represent 
official OERI position or policy. 



Hariharan Swaminathan 
Ronald K. Hambleton 
Stephen G. Sired 
Dehui Xing 
Saba M. Rizavi 

University of Massachusetts Amherst 



o 

CM 

co 

if) 

co 

© 



■ Law School Admission Council 
Computerized Testing Report 98-06 
September 2003 



LS AC 



A Publication of the Law School Admission Council 




O 

-C 






LSAC RESEARCH REPORT SERIES 



■ Small Sample Estimation in Dichotomous 
Item Response Models: Effect of Priors Based on 
Judgmental Information on the Accuracy of 
Item Parameter Estimates 



Hariharan Swaminathan 
Ronald K. Hambleton 
Stephen G. Sireci 
Dehui Xing 
Saba M. Rizavi 

University of Massachusetts Amherst 



■ Law School Admission Council 
Computerized Testing Report 98-06 
September 2003 



LSAC 



A Publication of the Law School Admission Council 






i 



Table of Contents 



Executive Summary 

Introduction 

Item Response Models 

Estimation of Parameters 

Estimation of Ability Parameters 

Estimation of Item Parameters 

Joint Maximum Likelihood Estimation. . . 
Marginal Maximum Likelihood Estimation 

Design of Study 

Sample Size 

Prior Information 

Evaluation of the Accuracy of Estimation . 

Results 

Results for the One-Parameter Model . . . 
Results for the Two-Parameter Model . . 
Results for the Three-Parameter Model . . 

Conclusions 

References 



. 1 

. 2 

. 2 

. 3 
. 3 
. 4 
. 4 
. 5 

. 6 
. 6 
. 7 
. 8 

. 9 
. 9 
. 12 
. 16 

. 22 

. 23 



1 



Executive Summary 

adn^id" 

The feasibility and advisability of computerized adaptive testmg is currently being studied by the w 

SCh ?o^dTptive teSto ^successful, it is important that a large pool of items be available with items 
whc^e item characteristics are known. The recent experiences of testing programs have clearly demonstrated 

sample size and the specification of prior information on the accuracy with which item paramete 

are Thelfesta priori source for information regarding the difficulty of items in a test is content specialists 

• - 

KSSSSEaSS^SS^garagg^ 



information are substantial, tne errects oi usm 6 »«»».»»- - r— - , 

demonstrated in this study and other forms of prior information be fully understood. 
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Introduction 

Item response theory (1RT) provides the accepted framework for addressing the fundamental Problems 
in testing: determining the proficiency level of test takers for certification and other reasons, assembly of test 
items, equating of tests, and examining the potential bias test items may exhibit toward minority or focal 
groups. In order to fully realize the advantages that item response theory offers, the parameters of item 
response models must be accurately estimated. These parameters are the ability or proficiency level 
parameter of a test taker, and the parameters that characterize the item. While estimation of tire test taker 
ability parameter is the ultimate goal of testing, this goal cannot be achieved without determining die # 
parameters that characterize the items. Once the item parameters are determined, items can be banked, 
and from this bank, items can be drawn and administered to test takers 

It is well established that the efficiency of testing can be considerab y increased if test takers are 
administered items that match their ability or proficiency levels (Hambleton, Swaminathan & Rogers, 1991, 
Lord 1980). In this adaptive testing scheme, items are administered to test takers sequentially, one at a time, 
or in sets. The item or set of items administered is usually chosen in such a way that it provides maximum 

information at the ability level of the test taker. , 

For adaptive testing to be successful, it is important that a large pool of items be available with items 
whose item parameters are known; that is estimated or calibrated using a sample of test takers. The recent 
experiences of testing programs have clearly demonstrated that without a large item pool, test security an 
be seriously compromised. One way to maintain a large pool of items is to replenish the pool by 
administering experimental items to a group of test takers taking an existmg test and calibrating the items. 
However, administering new or experimental items to a large group of test takers increases the exposure rate 
of these items, compromising test security. One obvious solution is to administer a set of experimental items 
to a randomly selected small group of test takers. Unfortunately, this solution raises another serious 

problem: that of estimating item parameters using small samples. , , „ 

The issue of sample size and its effect on item parameter estimation has been well studied (e.g., 
Swaminathan & Gifford, 1983). In general, large sample sizes are needed to estimate parameters, particula y 
in the two- and three-parameter item response models. The issue that needs to be a ddressedis a 
estimating or calibrating items using a small sample of test takers. Swaminathan and Gifford (1982, 1985, 
1986) and g Mislevy (1986) have shown that, by incorporating prior mformation about item parameter^, not 
only can item parameters be estimated more accurately, but the estimation can be carried out with smaller 
sample sizes. The purposes of the current investigation are (i) to examine how prior mformation can be 
specified, and (ii) to investigate the relationship between sample size and the specification of prior 
information on the accuracy with which item parameters are estimated. 

This report consists of a brief review of item response models and the issues that surround the 
estimation of item parameters. The procedure for incorporating prior information is described. The design o 
the study for investigating the relationship between sample size and prior information is described. Th 
results of the study are presented and the implications for estimating parameters are discussed. 

Item Response Models 

Dichotomous item response models are classified as one-, two-, or three-parameter models. For all these 
models the probability of response u, (u = 1 for a correct response and 0 otherwise) to an item, given the 
item parameters and the ability level of the test taker, is specified by a cumulative probability function, F(.). 
The common forms of F are the normal and the logistic cumulative probability functions. 

In the one-parameter model, the parameter that characterizes the item is called *e item difficulty 
parameter, b. For a test taker with ability 6, the probability of a correct response is 0.5 at 0 = bj l The 
one-parameter model was developed by Rasch (1960), and hence is commonly referred to as the Rasch 
model The probability of a correct response to item i in the Rasch item response model is 



P(«,= 1|M) = 



exp(0-b,) 

l+exp(0-b ( ) 



( 1 ) 
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The probability of a correct response for the two-parameter logistic model is conventionally written in 
the form 

exp a, (6 -b,) 



P(Uj = l|a f ,b,,0) = jz tV 

v ' 1 ' ' ' l+expfl,(0-b/) 



( 2 ) 



where a. , the discrimination parameter, is the slope of the item response curve at the point of 

Whereas in the Rasch model, the log-odds ratios of Rasch item response curves define parallel lines, m the 

two-parameter logistic models, the lines defined by the log-odds ratios are parallel only when the 

diSC 5jStTva t ted b^thTworkof Fkmey (1952), Bimbaum (1968) introduced the three-parameter logistic item 
response model given by the item response function 

expa ( (0-i>,) 

P ( u ‘ = 1 l c i' a i' b i' 0 ) = c ‘ + ( 1_C, )i + exp aTfl-b,)' (3) 



Here the lower asymptote 0<c,<l reflects the probability with which test takers with very low ability or 
6 values respond correctly to the item i. The parameter c, is known as the pseudo chance-level parameter, or 
simply as th ? e guessing parameter. Empirical studies have shown that with multiple choice items, die 
three-Darameter model fits the item response data better than the Rasch or two-parameter model. 

& P rmS described above assume that a single dimension 0 underlies the test takers 
responses to a set of items. The assumption of unidimensionality is an issue of some concern in the 
measurement literature. While multidimensional item response models have been formulated, the 
estimation problems associated with multidimensional models are far from being solved. Hence, only the 
estimation issues concerning unidimensional item response models are discussed in this study. 

FctimaHnn nf Parameters 



Estimation of Ability Parameters 

The parameter of ultimate importance in educational testing is the test taker's ‘ability or pr °hciency level 
6 If the item parameters are known a priori, the estimation of B is straightforward. Let U t - [uy m 2 , ... u„] | deno e 
the In x 11 vector of responses of a test taker to n items. In order to express the joint distnbution of lima 
tractable form, the assumption of conditional independence, or local independence ^ 
that the complete latent space is specified, that is, the number of dimensions that underlie the responses of th 
population o*f test takers to a set of items is correctly specified, it can be shown (Anderson, 1959, Lord & 
Novick, 1968) that the responses of a test taker to n items, conditional on ability, are independent, that , 



P(u lf u 2 

v 1=1 



(4) 



O 
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where B is the (r x 1) vector of abilities, and | is the vector of item parameters. When it is assumed that r 1, 
equation 4holds for unidimensional item response models. Thus, the likelihood function of *e observed 
item responses for a test taker given the item parameters, and, consequently, the maximum of the likelih 
function 5 are immediately obtained. The maximum likelihood (ML) estimator of 0 can be shown to possess 
the usual properties of ML estimators with increasing test length (Bimbaum, 1968). 

* Within a Bayesian framework, if the prior density of 6 is g (6 \ r) where r is the vector of known 
parameters then the posterior marginal density of B, P(B \ U, x ), contains all the information about the 
parameter B. The posterior mode or the mean may be taken as a point estimate s of B When r is not known, 
the hterarchical procedure suggested by Lindley and Smith (1972) may be applied. Swarrunathan and 
Gifford (1982, 1985, 1986) applied a two-stage procedure to obtain the joint posterior density of the abiliti 
of N test takers. They assumed that in the first stage 
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In the second stage, they assumed that n was uniform and <p had the inverse chi-square deraity with 
parameters v and A. The parameters n, and <J> were integrated out of the joint posterior density. The joint 
modes of the joint posterior density were taken as point estimates of the abilities of the test takers. The jornt 
posterior modes, being weighted estimates of the individual's estimate and the mean of the group, provide 
more stable estimates of the ability parameters than the mode or the mean of the one-stage Baj^s procedure. 
Because of the complex form of the joint density, Swammathan and Gifford did not obtain the marginal 
density or the joint means of the joint posterior density. However, in theory it is possible to obtain the 
moments of the posterior density using the approximations suggested by Tierney and Kadane (1986). 

An alternative procedure was provided by Bock and Mislevy (1982), who used the mean of the posterior 
distribution of 6 rather than the mode. This expected a posteriori (EAP) was obtained using a smgle stage 
procedure, assuming a priori that 6 had the standard normal distribution; that is, with mean zero and unit 

standard deviation. 

Estimation of Item Parameters 

While the estimation of ability parameters with known item parameters is relatively straightforward, the 
item parameters must be known or estimated from a calibration sample. If the ability parameters are known, 
then the item response model becomes a special case of quantal response models, and the estimation of item 
parameters is again straightforward. However, in general, neither the item parameters nor the ability 
parameters are known beforehand. 

Joint Maximum Likelihood Estimation 

The joint estimation of item and ability parameters was proposed by Lord (i953) md Bimbaum (1968). 
The joint likelihood function of the item and ability parameters, when responses of N test takers o 
are observed, is given by the expression 



L(U u U 2 U j U„|*, 0 = ft 

v i=\ j=\ 



( 6 ) 



where U = u j2 , ..., u jn ] is the vector of responses of test taker j on n items. It is assumed that the 
complete latent space is unidimensional, that is, local independence holds. ..... . / 

An examination of the item response models given in Equations 1-3 reveals that the parameters a (a 
parameter), B (b parameter), and 6 are not identified. Linear transformations leave the item res P ons J 
functions invariant, and hence the metric of 0 (or /3) must be fixed. For convenience, the mean and standard 
deviation of 6 (or B) are usually set at 0 and 1, respectively. In the Rasch model, only the mean of 6 (or /?) 
needs to be fixed. Once the metric of 6 is fixed, starting with provisional values of d, die item parameters are 
estimated by the conventional probit or logit analysis. The item parameters are held fixed at these values, 
and the values of 6 re-estimated. This process is repeated until convergence. . , , , 

The joint maximum likelihood estimation of item and ability parameters suffers from a major drawback. 
The ability parameters are incidental parameters while the item parameters are structural parameters. 

Neyman and Scott (1948) have shown that the ML estimates of the structural parameters are not consistent m 
drepresence of incidental parameters. While consistent ML estimators of item parameters are not available 
in the presence of unknown ability parameters for a finite number of items, Haberman (1977) showed that 
consistent estimates of the Rasch item parameters are obtained as the number of items and the number °f test 
SSSS=t Lit. Similar results are not available for the two- and three-parameter logistic models. 
Nevertheless, Swaminathan and Gifford (1983) demonstrated empirically through a series of sim ^f °" . f 
studies that the estimates of item parameters in the three-parameter model are consistent when^enumber of 
items and the number of test takers increase without bound. This empirical finding, although not totally 
satisfactory, provides some justification for using joint ML estimation with large numbers of items. 

Neyman and Scott (1948) also showed that if a minimal sufficient statistic is available for the incidental 
parameters, conditional maximum likelihood estimators can be devised for the s truc ‘ ural tor ? A 

conditional maximum likelihood estimators enjoy the usual properties °| max ^ 

minimal sufficient statistic for the ability parameter is available only for the Rasch model. The total score, r, 
obtained by summing the item scores, is a minimal sufficient statistic for the ability ^parameter m ^ ^?® ch 
model. By conditioning on r, Andersen (1970) obtained conditional maximum likelihood estimates of the 
item parameters. This procedure requires the computation of certain symmetric functions and becom s 
computationally tedious when the number of items is large. 
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Marginal Maximum Likelihood Estimation 

Since a minimal sufficient statistic for the ability parameter is not available for the two- and 
three-parameter logistic item response models, the conditional maximum likelihood procedure is not 

applicable for these models. Bock and Lieberman (1970) proposed .. 

procedure to overcome the difficulties inherent in the joint maximum likelihood procedure. Whereas 
foint ML procedure corresponds to the fixed-effects case, the marginal ML procedurecorresponds to amixed 
model in that the test takers are assumed to be a sample from a known population. The marginal likelihood 

function is 



l(u,,u 2 u UJ$) = J ft ftp(« i |0^)K 0 l ir ) d0 ' 



(7) 



where %(B 1 1) is the density function of 0. Bock and Lieberman (1970) took the standard normal density 
function for g(0 I r) and employed Gaussian quadrature to approximate the integral in Equation 7. ey 
solved the resulting likelihood equations using Fisher's method of scoring. InBock and Lieberman 
procedure the evafuation of the information matrix requires summing over 2 response patterns and not just 
the patterns realized in the sample. This made the procedure unwieldy and applicable only to a sma l 
number of items. Bock and Aitkin (1981) realized that by fixing certain terms which are functions of item 
parameters in the likelihood equations at the current values of the parameter estimates, the procedure could 
be simplified considerably and computational efficiency increased. They pointed out that die fixing of these 
tern^at current values of item parameter estimates could be justified in terms of die EM algorithm of 
Dempster, Laird, and Rubin (1977). Bock and Aitkin (1981), however, noted that their algorithm is not stricdy 
the same as the general EM algorithm. For random variables in the models not belonging to the exponential 
family, Dempste! Laird, and Rubin (1977) take the expected value of the logarithm of the likelihood function 
white Bock and Aitkin take the expected value of the likelihood function. It should be pointed out that the 
SdSSi appScation of theEM algorithm was not the first application of this algoridun to item 
response models. ILthanan and Blumenthal (1978) applied the EM algorithm to theRasch ^el to 
estimate the parameters r of g(0 | r). Their procedure, however, is restricted to the Rasch model and does not 

gen Rf^on^^Tsutakawa^l 9 ^) I and Tsutakawa (1984) apptied an extended form of the EM algorithm 
appropriate when the random variables in the models do not belong to the exponential farm y. They applied 
deprocedure developed by Dempster, Rubin, and Tsutakawa (1981) for estimating linear effects m mixed 
models to obtain marginal maximum likelihood estimates of item parameters in the one- and °'P a ™™ e e 
item response models. They also provided simplified computational procedures for estimating the item 

parameters, the ability parameters, and the variance of the ability distribution. , . 

P Bayesian procedures. While the marginal maximum likelihood procedures have theoretical advantages 
over the joinf maximum likelihood procedures, the estimates of the discrimination parameter «^dthe 
chance-level parameter y pose considerable problems in that these parameters are often poorly estimated 
and the estimates frequently drift off into inadmissible regions. Bayesian procedures show considerable 

promise in terms of their ability to successfully address these issues. . riffH ho 

P Bayesian procedures for estimating item parameters were proposed by Swaminathan and Gifford, who, 
in a series of papers (Swaminathan & Gifford, 1982, 1985, 1986), provided a hierarchical graced ure for » he 

and three- parameter models based on the Lindley-Smith approach (Lindley & Smi* ,19^). They 
assumed that the item difficulty parameters and the ability parameters are exchangeable and obtamed the 
,oSuSsi;y of 1 5ie1Sm and aEility parameters, marginatized with respect to the parameters of the ability 

and item difficulty distributions, that is. 



p(£,0|ll,(5,£) = L(lI|0,£)J f p(0\T)p(£\v)p(y\£)p(*\ d ) dxdr l' 



( 8 ) 
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where U contains the responses of N test takers on n items. In particular, Swaminathan and Gifford assumed 
that the parameters B: were independently and identically normally distributed with mean // and variance^, 
th^i^ranieteTa^lfaa a chi-density with parameters * and co; and the parameter y had a beta density with 
parameters p ancl q. They also provided procedures for specifying the parameters of the prior distributions. 
Swaminathan and Gifford obtained joint modal estimates of the posterior distiibuhon using the Newton-Raphs 
procedure to solve the modal equations. Their results were promising m that the drift of the parameter 
estimates was arrested and the parameters were estimated more accurately than the joint ML procedure. 
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A problem with the Bayesian approach of Swaminathan and Gifford was that it was not free from the 
criticisms that faced joint estimation of the parameters. Another problem was that different forms of prior 
distributions had to be specified for the various item parameters given the varying nature of the item 
parameters. One solution to this problem is to specify a multi-parameter density for the priors such as a 

multi-parameter beta distribution for the item parameters. . , R 

Mislevy (1986), Tsutakawa (1992), and Tsutakawa and Lin (1986) have provided a margmalized Bayes 
modal estimation procedure by integrating out the ability parameters and using the EM algorithm to 
estimate the parameters. Mislevy (1986) has suggested transforming the discrimmation and chance-level 
parameters, so that a multivariate normal prior for the item parameters can be specified. The specification of 
multivariate normal priors for the item parameters removes the problem inherent in the separate prior 
specifications proposed by Swaminathan and Gifford. However, Bayes modal estimates are not invariant 
with respect to transformations and hence the Bayes modal estimates of the transformed parameters cannot 
be transformed back to the original metric of the parameters. Nevertheless, the marginalized Bayes 
procedure of Mislevy is an improvement over the jomt procedure of Swaminathan i and Gifford. ™ 
marginalized Bayes procedure is currently implemented in the BILOG program (Mislevy & Bock, 1990) 
estimating item parameters in the dichotomous case, albeit with separate forms for the priors-a normal 
prioTfoTI, a logmormal prior for a, and a beta prior for y. The procedure suggested by Tsutakawa (1992) 
and Tsutakawa and Lin (1986) for specifying priors is basically different from that suggested by 
IwSatan and and Mistoy. Tsutakawa and Lin (1986) suggested an ojdtued tavanate beU 

distribution for the item response function at two ability levels, while Tsutakawa (1992) suggested the 
ordered Dirichlet prior on the entire item response function. These approaches are promising, but no 
extensive research has been done to date comparing this approach with other Bayesian approaches. 

Design of the Study 

The primary objective of this study was to investigate estimation of item parameters in small samples 
and to determine the specification of prior information that will result in accurate estimation m sma 
samples. Given this, the factors that were investigated were sample size and type of prior mformation. These 
two factors were examined with respect to the accuracy with which item parameters were estimated m the 

one- two-, and three-parameter item response models. . 

In order to investigate the accuracy with which item parameters are estimated, it is necessary to 
compare the item parameter estimates with the "true" item parameter values TyP 1 ^' such * n 
investigation is carried out using simulated data since true values of item and ability parameters cannot be 
known a priori With simulated data, general conditions can be simulated. One drawback, however, is tha 
the item parameter values selected for the study may not conform to real testing situations. More importantly, 
the distributions of ability and item parameters may conform too closely to the P no ^ 

Bayesian procedures are investigated, possibly limiting the generalizability of the resulte to real d ^ 
y FSately, the estimation procedures can be investigated with realdata-in this study, Law School 
Admission Test (LSAT) data from the Law School Admission Council (LSAC). Since the test was 
administered to a large group of test takers, calibrating the items with the entire population of test takers 
will yield true item parameters. With small samples randomly drawn from the population of test takers, 
varying ttiesample siLeTnd estimating the item parameters will yield the relationship between sample ue 
andtire accuracy with which item parameters are estimated. Moreover, the Bayesian procedures will yield 

untainted mformation regarding the effects of prior specifications on the ^“^fctS Corcler to 
Parameter estimation in the three-item response models were investigated m this study In order to 
obtain true parameter values for the parameters in the one-, two-, and three-parameter models, each mo e 
was fitted to the data for the LSAT Reading Comprehension section. Only the 21 items for which judges 
provided ratings of difficulty were used. The estimates corresponding to the relevant parameters in each 

model were taken as the true values. 

Sample Size 

One of the primary concerns in calibration is the minimal sample size that is needed to provide reasonably 
accurate estimates of item parameters. Hence, one of the factors that was examined in the study was sample 
size Sample size was varied from a relatively small sample (n = 100) to a modest sample size (n - 500). Six 
levels of 1 sample 6 stowere used in this study: 100, 150, 200, 300, 400 and 500. These sample sizes were 
chosen so that the effect of prior information could be studied carefully in a narrow range of sample 

size values. 
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